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Abstract 

This  paper  describes  MTP:  a  reliable  transport  protocol  that  uti¬ 
lizes  the  multicast  strategy  of  applicable  lower  layer  network  architec¬ 
tures.  In  addition  to  transporting  data  reliably  and  efficiently,  MTP 
provides  the  client  synchronization  necessary  for  agreement  on  the  re¬ 
ceipt  of  data  and  the  joining  of  the  group  of  communicants. 

Keywords:  reliable  transport,  multicast,  broadcast,  atomic  broad¬ 
cast,  agreement. 


1  Introduction 

A  multicast  transport  is  a  virtual  circuit  connection  among  a  set  of  commu¬ 
nicating  peer-level  processes.  As  such,  any  multicast  transport  protocol  has 
to  satisfy  somewhat  conflicting  goals.  Being  a  transport  protocol,  it  should 
supply  quick  and  reliable  delivery  of  large  amounts  of  client  data  among  the 
communicants.  Yet,  being  a  multicast  protocol,  it  should  be  able  to  supply 
the  ordering  and  agreement  on  the  delivery  of  the  data  that  is  necessary 
for  writing  decentralized  applications.  Agreement  on  order  and  delivery  can 

‘Keith  Marzullo  is  supported  in  part  by  the  Defense  Advanced  Research  Projects 
Agency  (DoD)  under  NASA  Ames  grant  number  NAG  2-593,  Contract  N00140-87-C- 
8904.  The  views,  opinions,  and  findings  contained  in  this  report  are  those  of  the  authors 
and  should  not  be  construed  as  an  official  Department  of  Defense  position,  policy,  or 
decision. 
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take  time,  thereby  slowing  the  delivery  of  the  data.  Hence,  most  multi¬ 
cast  protocols  concentrate  on  a  smaller  set  of  goals;  for  example.  [CP38] 
and  [CWS9]  concentrate  on  fast  delivery  while  [KTHB90]  concentrates  on 
the  fast  ordered  delivery  of  relatively  small  messages. 

MTP.  the  transport  described  in  this  paper,  attempts  to  satisfy  both  of 
these  goals.  MTP  is  a  full-duplex,  flow-controlled,  reliable  multicast  protocol 
in  which  the  data  is  sequenced  into  (perhaps  long)  messages.  Messages  are 
sent  within  a  process  group  called  a  we 6,  where  each  message  has  a  single 
sender  and  is  received  by  all  members  of  the  web.  The  members  of  the  web 
agree  on  the  order  of  receipt  of  all  messages  and  can  agree  on  the  delivery 
of  the  message  even  in  the  face  of  partitions1. 

MTP  can  be  thought  of  as  two  protocols:  a  transport  layer  running  un¬ 
derneath  an  ordering  and  agreement  layer.  The  transport  layer  is  a  negative 
acknowledgement  (or  NAK)  based  protocol  exploiting  the  high  probability 
of  successful  message  delivery  that  the  local  area  networks  of  today  pro¬ 
vide  [CLZ87].  Additionally,  this  transport  utilizes  the  underlying  data  link 
and  physical  layer’s  capability  to  do  multicast  addressing.  The  ordering  and 
agreement  protocol  uses  a  sequencer  site  [CM87,KTHB90]  called  the  master 
that  grants  serialized  tokens  to  producers. 

The  rest  of  this  paper  proceeds  as  follows.  In  Section  2,  the  class  of 
applications  for  which  MTP  is  meant  is  contrasted  with  those  applications 
other  atomic  broadcast  protocols  support.  The  protocol  is  presented  in 
Section  3.  Suggestions  for  values  of  the  protocol’s  parameters  are  derived  in 
Section  4,  and  a  discussion  of  MTP  is  given  in  Section  5. 

2  Applications 

MTP  is  designed  to  support  applications  that  consist  of  a  large  number  of 
processes,  where  the  processes  send  large  messages  and  where  the  appli¬ 
cation  must  be  fault-tolerant  (we  consider  crash  failures  of  processes  ana 
communication  link  failures  that  can  lead  to  partitioning).  Examples  of 
such  applications  include  multimedia  teleconferencing  systems,  mu1 . screen 
educational  systems,  and  stock  brokerage  systems.  In  making  this  assump¬ 
tion,  we  intentionally  exclude  certain  classes  of  applications  that  have  been 
considered  elsewhere;  in  particular,  those  structured  as  client-server  systems 
with  highly  available  services  ( e.g .  [Sch86,MS88]). 

‘A  partition  is  the  separation  of  a  network  of  processes  into  two  or  more  disjoint  sets 
that  cannot  communicate  with  each  other. 
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Ono  issue  that  MTP  must  address  is  the  efficient  handling  of  network 
partitioning.  An  argument  can  be  made  that  transient  partitioning  is  a 
very  common  failure  in  the  kind  of  applications  we  are  considering  [Cri90]. 
Timeouts  are  used  to  detect  both  crash  failures  and  communication  failures. 
If  a  machine  uses  a  timeout  period  that  is  too  short,  then  it  will  appear  to 
the  machine  that  the  network  has  temporarily  partitioned.  For  CSMA/CD 
type  data  links,  there  is  no  upper  bound  on  message  delay  (communication 
and  operating  system  software  can  also  increase  the  variance  of  this  delay), 
so  such  transient  partitions  will  be  unavoidable.  The  application  designer 
must  balance  the  cost  of  recovery  from  partitioning  against  the  penalty  of 
using  excessively  long  timeouts.  Additionally,  packets  can  be  dropped  due 
to  temporary  congestion  at  both  routers  and  workstations,  again  creating 
transient  partitions. 

Our  approach  to  tolerating  partitions  is  to  choose  one  process  in  the  web 
to  be  a  distinguished  process  called  the  master.  Since  an  MTP  web  contains 
such  a  distinguished  process,  partitions  can  be  treated  in  the  same  way  as 
crash  or  timing  failures.  If  the  master  process  po  cannot  communicate  with 
a  member  process  pi,  then  po  assumes  that  p\  has  failed.  If  px  has  instead 
partitioned  away  from  po,  pi  will  know  that  po  considers  pi  to  have  failed 
and  behave  accordingly.  The  vulnerability  of  a  web  to  the  failure  of  the 
master  is  a  matter  of  concern,  however.  If  the  application  is  to  be  long- 
lived,  care  must  be  taken  in  choosing  the  machine  that  runs  the  master.  In 
Section  5.2,  we  discuss  some  techniques  for  making  a  master  more  robust. 

Most  other  atomic  broadcast  algorithms  are  structured  in  a  very  decen¬ 
tralized  manner  so  the  failure  of  any  (usually  size-bounded)  subset  of  the 
processes  will  not  cause  the  application  to  fail.  Being  fault-tolerant  in  this 
manner  is  very  important  for  implementing  highly- available  services,  but 
it  means  that  the  complex  issue  of  tolerating  partitions  in  a  decentralized 
manner  must  be  addressed  [DGMS85]2. 

A  more  detailed  description  on  the  issues  and  uses  of  reliable  broadcast 
protocols  can  be  found  in  [JB89]. 

2  A  notable  example  of  an  atomic  broadcast  protocol  that  does  not  have  a  decentralized 
structure  is  described  in  [KTHB90],  although  as  presented,  this  protocol  cannot  tolerate 
partitioning. 
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3  Protocol 

Section  3.1  describes  the  overall  structure  of  an  MTP  web.  In  Section  3.2.  the 
ordering  and  agreement  protocol  is  described  assuming  an  abstract  trans¬ 
port  protocol.  In  Section  3.3,  the  transport  protocol  is  described,  and  in 
Section  3.4  the  ordering  and  agreement  protocol  is  extended  to  support  the 
establishment  of  a  web  and  the  joining  of  a  member. 

3.1  Web  Structure 

An  MTP  web  consists  of  a  master  process  and  a  set  of  member  processes. 
Member  processes  may  join  and  leave  the  web.  but  the  master  process  can¬ 
not,  as  the  web  is  both  instantiated  and  terminated  by  the  master.  All 
data  is  reliably  multicast:  that  is,  every  process  agrees  on  the  order  that 
a  given  message  will  be  processed,  and  the  transport  guarantees  that  any 
given  message  is  either  accepted  by  all  non-failed  processes  or  not  accepted 
by  any  non-failed  processes. 

There  are  four  transport  service  access  points  (TSAPs)  associated  with 
a  given  web: 

1.  Multicast  transport  addresses :  These  are  the  addresses  to  which  all 
messages  targeted  for  the  entire  web  are  transmitted.  Each  consists 
of  a  multicast  network  service  access  point  (NSAP)  catenated  with  a 
unique  transport  connection  identifier. 

2.  Master’s  transport  address:  This  is  the  TSAP  for  the  master  process. 
This  address  is  the  destination  of  messages  for  the  master  process, 
such  as  requesting  a  token  or  leaving  the  web.  This  address  is  also  the 
source  of  any  message  sent  by  the  master  process. 

3.  Join  transport  address:  This  is  the  NSAP  for  the  service3  catenated 
with  the  predefined  join  transport  connection  identifier.  This  address 
is  the  destination  of  all  requests  to  join  the  web. 

4.  Member  transport  addresses:  These  are  the  addresses  of  all  the  pro¬ 
cesses  that  are  currently  members  of  the  web.  Each  consists  of  the 
member  process  NSAP  catenated  with  a  unique  transport  connection 
identifier.  The  source  of  any  packet  transmitted  by  a  process,  regard¬ 
less  of  the  packet’s  destination,  is  a  member  of  this  set. 

*  Determining  this  multicast  NSAP  for  a  given  instantiation  is  not  a  function  performed 
by  MTP. 
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3.2  Sequencing  Messages 

The  agreement  and  ordering  layer  of  MTP  ensures  that  all  processes  agree  on 
which  messages  are  accepted  and  in  what  order  they  are  accepted.  Let  p ,  be 
a  member  process  and  M,  be  the  sequence  of  messages  that  p,  has  delivered 
to  its  client.  The  agreement  layer  ensures  the  following  two  properties: 

AB-1  The  sequence  of  messages  that  processes  have  delivered  to  the  clients 
do  not  diverge;  that  is,  for  all  processes  p,  and  p:,  M,  is  a  prefix  of  M} 
or  Mj  is  a  prefix  of  A/;. 

AB-2  There  exists  a  connected  subset  of  the  nonfaulty  (t.e.  noncrashed) 
processes  that  make  progress. 

Figures  1,  2  and  3  shows  pseudocode  for  the  MTP  agreement  and  or¬ 
dering  protocol.  In  these  figures,  the  primitive  send  p  x  sends  the  message 
x  to  process  p  without  blocking,  the  primitive  receive  p  x  is  a  CSP-like 
guard  [Hoa78]  that  receives  a  matching  message  from  process  p  and  stores 
it  into  x,  and  multicast  P  x  multicasts  the  message  x  to  the  processes  in 
the  set  P  without  blocking.  The  predicate  failed(s)  represents  a  timeout;  it 
will  become  true  at  some  point  after  the  processor  that  was  issued  the  token 
for  message  number  s  has  crashed  or  remained  partitioned  away  from  the 
master. 

To  send  a  message  m  to  a  web,  a  member  process  first  requests  one  of  a 
set  of  t  tokens  from  the  master  of  the  web.  This  token  contains: 

•  the  message  number  to  be  assigned  to  m, 

•  the  multicast  transport  addresses  as  discussed  in  Section  3.1, 

•  the  status  of  the  last  t  messages.  Such  messages  can  be  accepted , 
rejected ,  or  pending.  Furthermore,  the  earliest  of  these  t  messages 
must  either  be  accepted  or  be  rejected. 

The  master  sets  the  status  of  the  last  t  messages  using  the  following  rule. 
Let  m  is  one  of  these  last  t  messages: 

•  if  the  master  has  seen  the  message  m,  then  the  status  is  accepted  4; 

4The  master  has  seen  message  m  when  it  has  received  a  data  packet  of  message  m 
containing  an  end  of  message  indicator;  see  Section  3.3. 
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process  Master 
begin 

members:  integer  set  :=  \  ...  }; 

status  (1..  t):  Status  :=  undefined .  undefined ; 

t:  constant  integer;  the  number  of  tokens 
next:  integer  :=  1; 

s:  integer; 
m:  Data; 

last  (1..  t):  Status; 

do  receive  Sender(i:  l..n)  ["token. request”]  and  status(t  —  1)  ^  pending 
begin 

status(l..  t)  :=  pending,  status(l..  t  —  1); 
next  :=  next  +  1; 

send  Sender(i)  [ "token. grant" ,  next,  status,  members] 
end 

[]  receive  Receiver(i:  l..n)  ["data”,  s,  m,  last]  — 

if  next  —  s  <  t  and  status(next  —  s)  =  pending 
then  status(next  —  s)  :=  accepted: 

0  failure(s)  and  next  —  s  <  t 

and  status(next  -  s)  =  pending  — ►  status(next  -  s)  =  rejected 
od 
end 


Figure  1:  Agreement  Protocol  for  Web  Master 

•  if  the  master  has  not  seen  the  message  m  but  the  sender  of  m  is 
still  operational  and  connected  to  the  master  (as  determined  by  the 
master),  then  the  status  is  pending ; 

•  otherwise,  the  status  is  rejected. 

An  abbreviated  proof  of  this  protocol  is  presented  in  the  Appendix.  In¬ 
formally,  the  specification  is  met  because  the  behavior  of  the  web  is  defined 
by  the  behavior  of  the  master.  In  particular,  a  member  process  accepts  a 
message  m  only  if  the  master  accepts  m,  and  all  messages  are  accepted  in 
the  order  of  their  message  sequence  numbers;  thus,  AB-1  is  met.  We  de¬ 
fine  the  connected  subset  of  correct  processes  referred  to  in  AB-2  as  those 
processes  S  that  remain  connected  to  the  master.  The  master  will  accept 
messages  sent  by  processes  in  5  and  possibly  reject  other  messages,  and  the 
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process  Sender(i:  1..  n) 
begin 

last  (1..  t):  Status; 
members:  integer  set; 
s:  integer; 
m:  Data; 

do  receive  producer(i)  [m]  — 
begin 

send  Master  ["token. request”]; 
receive  Master  ["token. grant”,  s,  last,  members]; 
multicast  {Master}  U  Receiver(members)  ["data”,  s,  last,  m] 
end 
od 
end 


Figure  2:  Agreement  Protocol  for  Web  Producer 


members  of  5  will  in  turn  accept  and  reject  these  same  messages  as  other 
messages  are  sent5. 

Having  obtained  a  token,  s  multicasts  message  m  with  the  token  for  mes¬ 
sage  m  included  in  the  header  of  the  data  packets  that  carry  m.  Processes 
learn  the  status  of  earlier  messages  by  seeing  such  packets,  and  can  accept 
and  reject  messages  accordingly.  This  protocol  can  tolerate  up  to  a  sequence 
of  t  failures;  if  there  are  t  +  1  failures,  then  the  master  could  send  tokens 
to  these  processes  which  could  then  fail  before  any  nonfaulty  process  sees 
any  data  sent  with  these  t  +  1  tokens.  The  headers  of  these  tokens  carry 
information  about  the  status  of  earlier  messages,  and  since  no  other  process 
received  any  data  sent  with  the  earliest  token,  the  status  of  some  message 
will  never  be  propagated  to  the  members  of  the  web. 

3.3  Sending  Messages 

The  transport  multicast  layer  of  MTP  is  implemented  using  the  multicast 
capability  provided  by  the  network  layer  (which  in  turn  is  provided  by  the 
data  link  and  physical  layers).  For  the  purposes  of  this  paper,  we  assume 

4  As  written,  a  message  can  be  acknowledged  (and  hence  delivered)  only  when  another 
message  is  sent.  However,  the  master  can  send  empty  packets,  defined  in  Section  3.3,  in 
order  to  expedite  the  delivery  of  a  message  when  subsequent  messages  are  slow  in  being 
generated. 
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process  Receiver(i:  1..  n) 
begin 

data  (1..  ):  Data  :=  empty  ..  ; 
status  (l..  ):  Status  :=  pending  ..  ; 
nextln,  nextOut:  integer  :=  1,  1; 

last  (1..  t):  Status; 
s:  integer; 
m:  Data; 

do  receive  Receiver(j:  l..n)  ["data”,  s,  last,  m]  — 
begin 

k:  integer  :=  2; 
data(s  -  1)  :=  m; 

do  k  <  t  —  status(s  —  k)  ;=  last(k);  k  :=  k  +  1  od; 

nextln  :=  max  s,  nextln 

end 

|]  receive  consumer(i)  and  status(nextOut)  =  accepted  — * 
if  data(nextOut)  ^  empty  — ► 

send  consumer(i)  (data(nextOut)];  nextOut  :=  nextOut  +  1 
|]  data(nextOut)  =  empty  —  rejoin 
fi 

D  status(nextOut)  =  rejected  —  nextOut  :=  nextOut  +  1 
0  status(nextOut)  =  pending  and  (nextln  -  nextOut)  >  t  —  rejoin 
od 
end 


Figure  3:  Agreement  Protocol  for  Web  Consumer 


that  a  multicast  to  all  of  the  processes  in  a  web  can  be  accomplished  by 
performing  multicasts  to  a  small  number  of  transport  service  access  points 
(TSAPs) — no  more  than  can  be  included  in  the  data  portion  of  an  MTP 
packet.  Network  facilities  similar  to  those  described  in  [DC90]  support  this 
facility,  but  are  not  necessary  for  MTP  to  operate. 

The  transport  layer  treats  a  message  as  an  uninterpreted  sequence  of 
bytes  terminated  by  an  end  of  message  marker.  The  transport  layer  frag¬ 
ments  a  message  into  a  sequence  of  packets.  Each  packet  carries  a  sequence 
number  of  the  form  ( m,p )  where  m  is  the  message  number  and  p  is  the  packet 
number  in  this  message,  starting  at  zero.  For  example,  if  message  5  were 
broken  into  3  packets,  then  the  packets  would  be  sequenced  as  (5,0),  (5,1) 
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ami  (.5.2)  (of  which  the  last  would  carry  an  end  of  message  marker),  and 
the  next  packet  would  be  sequenced  as  (6.0). 

There  are  three  parameters  that  control  the  flow  of  data  in  the  transport 
layer.  They  are: 

•  heartbeat:  A  base  unit  of  time,  in  milliseconds. 

•  window:  The  maximum  number  of  data  packets  a  producer  can  send 
during  any  heartbeat. 

•  retention:  The  maximum  number  of  heartbeats  a  producer  must 
buffer  packets  for  possible  retransmissions. 

Data  is  transmitted  in  a  burst  of  packets  such  that  no  more  than  the 
current  window  of  data  packets  will  be  sent  during  a  single  heartbeat.  Every 
packet  transmitted  (including  control  packets)  always  contains  the  latest 
heartbeat,  window  and  retention  information  along  with  the  statuses  of  the 
previous  t  messages  and  the  next  message  sequence  number.  If  full  packets 
are  not  available6,  empty  packets  will  be  transmitted  instead  (defined  below). 
The  only  data  packets  that  will  be  transmitted  containing  less  than  the 
maximum  capacity  will  be  those  that  mark  a  client  state  transition. 

A  empty  packet  is  a  control  packet  that  is  multicast  into  the  web  at  reg¬ 
ular  intervals  whenever  the  producer  owning  a  token  cannot  transmit  client 
data.  Empty  packets  are  sent  to  maintain  synchronization  and  to  advertise 
the  maximum  sequence  number  of  the  producer.  Empty  packets  provide  the 
opportunity  for  consuming  processes  to  detect  and  request  retransmission 
of  missed  data  as  well  as  identifying  the  owner  of  a  transmit  token. 

If  a  producer  receives  a  NAK  from  a  consumer  requesting  the  retransmis¬ 
sion  of  one  or  more  packets,  those  packets  will  be  multicast  to  the  entire  web 
or  to  a  selected  subset  of  the  multicast  TSAPs.  All  retransmitted  packets 
will  contain  the  original  client  information  and  sequence  number.  However, 
the  retransmitted  packets  will  contain  updated  parameter  information  (the 
heartbeat,  window  and  retention).  As  no  more  than  than  a  full  window  of 
data  messages  can  be  sent  during  one  heartbeat,  retransmitted  packets  have 
priority  over  new  packets  during  the  next  heartbeat. 

The  producer  is  obligated  to  retransmit  a  packet  upon  request  for  at  least 
retention  heartbeats  after  its  original  transmission  (even  after  the  message 
has  been  completely  sent).  If  the  producer  receives  a  NAK  from  a  consumer 

8The  resource  being  flow  controlled  is  a  packet  carrying  client  data.  Consequently,  full 
packets  provide  the  greatest  efficiency. 
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process  requesting  the  retransmission  of  a  packet  that  is  no  longer  available, 
rhe  producer  sends  a  nak  deny  to  the  source  of  the  request.  If  the  consumer 
cannot  recover  from  the  loss  of  this  packet,  then  the  consumer  rejoins  the 
web  to  resynchronize. 

Figure  4  shows  a  space-time  diagram  of  a  process  transmitting  into  a 
web  assuming  no  NAKs,  and  Figure  5  illustrates  data  transmission  and 
NAK  processing. 

3.4  Consistency  and  Joining  the  Web 

A  process  p,  may  become  unrecoverably  inconsistent  with  the  master  of  the 
web  for  several  reasons.  The  most  likely  reason  is  that  p,  has  partitioned 
away  long  enough  from  the  master  so  that  p,  missed  learning  the  status  of  a 
message.  A  less  likely  scenario  is  that  some  process  p}  transmits  a  message 
that  is  received  by  the  master  but  not  by  pi,  and  p}  crashes  before  p;  can  ask 
for  retransmission  of  the  missed  packets.  In  any  case,  when  a  process  finds 
itself  inconsistent  with  the  master,  it  can  resynchronize  itself  by  rejoining 
the  web. 

As  described  in  Section  3.1,  the  master  of  a  web  constructs  the  master 
transport  address  by  catenating  the  NSAP  with  a  locally  generated  unique 
transport  connection  identifier.  A  process  that  wishes  to  join  or  rejoin  the 
web  will  send  a  join  request  message  to  the  join  transport  address,  and  the 
master  will  answer  with  a  join  response  carrying  a  source  of  the  master 
transport  address.  Note  that  a  rejoining  process  can  determine  whether  the 
web  is  the  same  session  with  which  it  became  inconsistent  by  comparing 
the  previous  and  new  transport  connection  identifiers  it  obtained  in  the  join 
confirm  messages. 

In  general,  a  process  that  repeatedly  receives  no  join  confirm  cannot 
elect  itself  the  master.  Another  process  may  follow  the  same  reasoning  in 
another  partition,  and  then  if  the  partition  were  to  end,  there  would  be  two 
inconsistent  webs  with  undesirable  properties;  for  example,  a  third  joining 
process  would  nondeterministically  join  one  of  the  two  existing  webs.  Any 
“merging”  of  such  inconsistent  webs  would  have  to  be  done  outside  of  MTP, 
as  the  semantics  of  such  a  merge  would  depend  on  the  application.  A  better 
method  for  master  selection  would  be  for  a  process  to  know  a  priori  if  it 
were  the  master  or  not.  Doing  so  would  both  guarantee  that  there  exists 
oniy  one  active  web  with  a  given  NSAP  and  would  allow  the  master  to  be 
located  on  a  machine  that  is  known  to  be  reliably  available. 

Having  joined  a  web,  a  process  p  must  be  informed  which  message  it 
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should  first  accept.  If  p  does  not  need  to  be  given  any  state  in  order  to 
process  the  next  message,  then  the  master  can  immediately  reply  to  the  join 
request  message  with  a  join  confirm  message  containing  the  sequence  number 
.s  of  the  next  token  the  master  will  hand  out.  Then,  the  joining  process  p 
need  only  start  receiving  messages  with  sequence  numbers  greater  than  or 
equal  to  s.  However,  for  some  applications  p  would  need  to  be  initialized 
with  the  state  of  the  web  after  all  message  before  s  have  been  accepted  or 
rejected.  In  this  case,  having  received  a  join  request,  the  master  will  stop 
granting  token  requests  and  will  delay  sending  a  join  confirm  message  to  p 
until  all  message  before  s  have  been  accepted  or  rejected.  Then,  the  master 
can  respond  with  the  join  confirm,  p's  state  can  be  initialized  (either  by 
having  the  master  send  the  state  or  through  a  protocol  outside  of  MTP). 
and  the  master  can  resume  granting  tokens. 

Figure  6  shows  a  space-time  diagram  illustrating  the  sequence  of  mes¬ 
sages  during  a  join  with  a  transfer  of  state  from  the  master. 

4  Parameter  Values 

The  values  of  heartbeat,  window  and  retention  can  be  adjusted  by  the  trans¬ 
port  to  reflect  the  capability  of  the  members,  the  type  of  application  being 
supported  and  the  network  topology.  In  general,  the  producers  will  try  to 
drive  these  numbers  towards  a  higher  performance  level,  and  the  consumers 
will  try  to  drive  these  numbers  towards  a  higher  reliability  level.  By  doing 
so,  both  are  trying  to  optimize  the  quality  of  service. 

Producers  can  try  to  improve  the  performance  by  reducing  the  heartbeat 
interval  and  by  increasing  the  window  size.  This  will  have  the  effect  of 
increasing  the  resources  committed  to  the  transport  at  any  time.  To  level 
the  resource  commitment,  the  producer  may  also  reduce  the  retention.  In 
the  worst  case,  a  producer  must  commit  enough  storage  to  hold  window  size 
x  retention  maximum-size  packets  for  heartbeat  x  retention  milliseconds 

Consumers  must  rely  on  their  clients  to  consume  the  data  occupying  the 
resources  of  the  transport.  The  consumer  transport  implementation  must 
monitor  the  level  of  committed  resources  in  order  to  ensure  that  resources 
are  not  overcommitted.  Since  MTP  is  a  NAK-based  protocol,  the  consumer 
is  required  to  inform  a  producer  if  a  change  in  parameters  is  required.  A 
consumer  must  be  capable  of  committing  at  least  t  times  the  memory  com¬ 
mitted  by  a  producer. 

For  more  reliable  operation,  a  consumer  would  try  to  extend  the  heart- 
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beat  interval  and  increase  retention.  This  has  the  effect  of  increasing  the 
resources  needed  to  support  the  transport.  To  counteract  this,  the  consumer 
could  reduce  the  window. 

In  order  to  make  these  parameters  more  concrete,  consider  MTP  running 
on  a  collection  of  1-MIP  workstations  with  local  industry-standard  disks, 
communicating  over  a  IEEE  802.3  local  area  network.  The  heartbeat  is 
approximately  the  transport  time  constant.  Assuming  that  the  transport 
can  be  modeled  as  a  closed  loop  function,  reaction  to  feedback  into  the 
transport  should  settle  out  in  three  time  constants.  In  a  transport  that  is 
constrained  to  a  single  network,  the  dominant  cause  of  processing  delay  will 
most  likely  be  the  page  fault  resolution  time.  The  time  to  service  a  page 
fault  is  overwhelmingly  the  disk  access  time,  and  for  the  current  industry- 
standard  disks,  around  40  milliseconds  is  the  average  worst-case  access  time. 
In  the  worst  case,  this  time  could  double  in  order  to  reclaim  a  dirty  page. 
Allowing  for  additional  overhead  and  scheduling  delays,  two  times  the  worst 
case  page  fault  resolution  time  should  be  a  suitable  minimum  transport  time 
constant,  which  is  160  milliseconds. 

The  window  is  the  number  of  packets  that  can  be  consumed  during 
one  heartbeat.  For  IEEE  802.3  local  area  networks,  the  transmit  time  per 
packet  is  1.2  milliseconds  for  a  full  packet  of  1500  bytes.  The  processing 
time  on  a  1-MIP  machine  running  Unix  should  be  around  5  milliseconds  for 
a  full  packet  (where  2.5  -  3  ms  of  this  is  incurred  by  the  operating  system). 
Assuming  that  the  data  for  the  packet  originated  from  a  disk  backing  store 
and  that  disk  service  overhead  is  comparable  to  network  service  overhead, 
the  resulting  overhead  is  11.2  milliseconds  per  packet,  corresponding  to  a 
bandwidth  of  1  Mbit/sec.  During  a  heartbeat  of  160  milliseconds  14  packets 
can  be  sent,  so  the  maximum  window  would  be  approximately  14  packets 
per  heartbeat. 

At  worst,  each  producer  could  consume  10  percent  of  the  available  net¬ 
work  bandwidth,  so  MTP  will  not  be  limited  by  the  network  bandwidth. 
Each  producer  consumes  about  80  percent  of  the  consumer’s  processing 
time,  so  having  more  than  one  producer  outstanding  could  saturate  a  con¬ 
sumer.  However,  to  a  point,  having  multiple  tokens  allows  some  producers 
to  acquire  a  token  shortly  before  it  is  required  (presumably  overlapping  the 
transmission  of  an  earlier  message)  without  locking  out  another  producer. 
Additionally,  increasing  t  decreases  the  average  message  delivery  time  (until 
thrashing  becomes  a  problem).  Since  the  peak  resource  requirement  scales 
linearly  with  t,  a  reasonable  value  of  t  would  probably  be  two  or  three. 

Reducing  retention  may  introduce  instability  because  a  consumer  will 
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have  less  opportunity  to  react  to  missing  data.  Data  can  be  missed  lor 
a  variety  of  reasons.  If  constrained  to  the  local  net.  the  data  lost  due  to 
corruption  should  be  around  one  packet  in  50.0007.  Four  orders  of  magnitude 
more  packets  are  lost  at  receiving  stations,  including  packet  switch  routers, 
than  over  physical  links.  The  losses  are  usually  the  result  of  congestion  and 
resource  starvation  at  lower  layers  due  to  the  processing  of  (nearly)  back  to 
back  packets.  One  can  only  require  that  a  receiving  station  be  capable  of 
receiving  some  number  of  back  to  back  packets  successfully,  and  that  number 
must  be  at  least  greater  than  the  window  size.  The  probability  of  success 
can  be  made  as  high  as  needed  by  providing  the  receiver  the  opportunity  to 
observe  the  data  multiple  times. 

At  worst,  the  receiving  station  detects  packet  loss  using  timers.  Such 
timers  might  have  a  granularity  of  more  than  two  orders  of  magnitude 
greater  than  the  maximum  packet  transmit  time.  As  such,  the  worst  case 
is  much  worse  than  detecting  data  loss  due  to  gaps  in  sequence  numbers. 
When  the  loss  is  detected,  the  response  (a  NAK)  is  transmitted  and  should 
be  received  at  the  producing  process  in  less  than  two  heartbeats  after  the 
data  it  references  was  transmitted.  Again,  it  is  the  detection  time  that  dom¬ 
inates,  not  the  transmission  of  the  NAK.  NAKs  are  also  subject  to  loss,  but 
the  probability  of  delivery  can  be  made  close  to  one  by  retransmitting.  In 
order  to  be  able  to  respond  to  a  second  NAK,  the  minimum  retention  is 
three. 

The  resources  committed  to  a  transport  using  the  above  assumptions  are 
buffers  sufficient  for  126  packets  of  1500  bytes  each,  and  each  buffer  will  be 
committed  for  at  least  480  milliseconds. 

The  parameters  would  be  very  different  for  a  web  that  spans  an  internet¬ 
work  of  several  LANs,  and  could  be  adjusted  to  accommodate  the  properties 
of  the  network.  For  example,  if  a  producer  is  separated  from  a  set  of  con¬ 
sumers  by  a  router  and  the  router  drops  a  packet  due  to  congestion  then  all 
of  the  consumers  will  simultaneously  send  NAKs,  further  aggravating  the 
congestion.  To  avoid  this  burst  of  NAKs,  the  master  could  have  previously 
set  the  web’s  retention  to  /  +  3  for  some  positive  value  of  /.  Each  NAKing 
consumer  would  cnen  dally  for  some  number  of  heartbeats  between  0  and 
/  before  NAKing  a  missed  packet.  Not  only  would  this  dallying  reduce  the 
number  of  simultaneous  NAKs  by  a  factor  of  /,  but  most  processes  would 
probably  receive  the  retransmission  without  sending  a  NAK. 

telephone  links  (between  routers,  for  example)  axe  capable  of  exhibiting  similar  cor¬ 
ruption  rates. 
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5  Discussion 

5.1  Number  of  Tokens 

In  Section  4,  it  was  argued  that  a  reasonable  number  of  tokens  would  be 
around  two  or  three.  It  isn't  clear  what  the  number  of  tokens  should  be 
when  a  web  spans  a  larger  collection  of  networks.  On  one  hand,  having  more 
tokens  allows  more  processors  to  pre-allocate  tokens,  thereby  overlapping  the 
longer  round-trip  message  time  with  (hopefully)  other  processing.  On  the 
other  hand,  the  maximum  number  of  buffers  increases  with  the  number  of 
tokens,  and  processors  distant  from  the  master  are  more  likely  to  partition 
away  from  the  master,  thereby  increasing  the  number  of  failures. 

One  can  allow  the  master  to  find  a  balance  by  varying  the  number  of 
tokens.  This  is  done  by  logically  splitting  t  into  the  two  values  tmaz,  which  is 
the  maximum  number  of  tokens  that  can  be  outstanding  and  is  the  number  of 
message  statuses  carried  in  a  header,  and  ,  which  is  the  current  maximum 
number  of  tokens  that  can  be  outstanding  and  need  be  known  only  by  the 
master.  The  number  of  failures  that  can  be  tolerated  is  determined  by 
fmox  (see  the  discussion  in  the  Appendix).  The  master  could  then  vary  tcur 
between  1  and  tmax  depending  on  the  web  performance. 

5.2  Resiliency  Against  Failure  of  the  Master 

The  main  vulnerability  of  MTP  is  that  the  failure  of  the  master  can  cause 
the  web  to  fail.  For  some  applications  ( e.g .,  a  stock  brokerage  system), 
such  a  failure  could  be  intolerable.  In  this  case,  it  would  become  desirable 
to  replicate  the  web  master.  Replicating  the  master  for  high  tolerance  to 
processor  failure  can  be  done  without  changing  MTP,  but  having  a  replicated 
master  would  be  noticed  by  the  members  as  an  increase  in  the  response  time 
to  a  token  request  (and  less  importantly,  to  a  join  request). 

All  the  master  replicas  Po,Po,---Po  would  reside  on  an  unpartitionable 
network  (for  example,  a  single  local  area  network),  guaranteeing  that  if  a 
member  p\  is  connected  with  Po  and  member  p?  is  connected  with  p§,  then 
Pi  is  connected  with  p§  and  pi  is  connected  with  p\.  The  web’s  master 
TSAP  would  be  a  multicast  address  for  these  replicated  masters. 

The  masters  would  choose  one  amongst  themselves  to  be  the  coordinator, 
with  the  rest  being  cohorts  [BJ87].  Any  replica  receiving  a  request  would 
at.nmira.lly  broadcasts  the  request  to  all  the  master  replicas  before  the  co¬ 
ordinator  would  respond.  Similarly,  when  the  coordinator  decides  that  a 
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message  becomes  accepted,  the  coordinator  would  first  atomically  broad¬ 
casts  this  fact  to  all  the  master  replicas8.  If  the  coordinator  were  then  to 
fail,  one  cohort  would  become  the  new  coordinator.  This  new  coordinator 
would  reject  all  messages  that  it  considered  pending  and  start  responding  to 
master  requests. 

5.3  Web  Membership 

One  issue  we  have  not  discussed  in  this  paper  is  how  a  process  can  determine 
the  current  membership  of  a  web.  Knowing  this  information  can  be  very 
useful;  for  example,  if  all  the  processes  agree  on  the  current  web  membership, 
then  each  can  agree  a  priori  on  how  work  should  be  partitioned  amongst 
themselves.  The  group  membership  problem  is  essentially  that  of  having  the 
web  members  agree  on  when  a  process  joins  the  web  and  when  a  process 
leaves  the  web  (either  by  failing,  by  partitioning  away,  or  under  its  own 
volition)  [Cri88,Ric90].  The  difficulty  with  the  group  membership  problem 
is  that  it  really  cannot  be  “solved”;  since  a  process  can  fail  without  notifying 
any  other  process,  a  member  of  a  web  cannot  be  sure  whether  or  not  another 
process  is  currently  a  member.  The  best  that  can  be  done  is  to  have  the 
web  members  agree  on  the  membership  of  the  web,  and  accept  the  fact  that 
there  may  be  members  that  have  crashed,  and  that  there  may  be  processes 
that,  due  to  the  asynchronism  in  the  system,  have  been  excluded  from  the 
web  even  though  they  have  not  crashed  or  partitioned  away9. 

Group  membership  protocols  operate  by  having  processes  monitor  each 
other.  If  a  process  p7  decides  that  another  process  p  has  failed,  then  p'  uses 
some  reliable  broadcast  protocol  to  disseminate  this  information  to  the  other 
web  members  [Ric90|.  A  common  method  of  detecting  whether  a  process 
p  has  failed  or  not  is  to  use  low-level  “alive”  messages:  other  processes 
periodically  expect  such  messages  from  p  (perhaps  as  the  result  of  periodic 

®As  stated  in  this  paper,  the  only  time  a  member  learns  the  status  of  a  message  is 
when  it  receives  a  token  or  a  data  message  from  another  member.  So,  if  the  coordinator 
were  to  notify  the  cohorts  before  granting  a  token,  then  the  cohorts  would  be  consistent. 
However,  in  the  actual  protocol  the  master  may  send  periodic  empty  packets  to  expedite 
the  delivery  of  messages.  If  this  empty  packet  advertises  a  new  status,  then  the  coordinator 
must  inform  the  cohorts 

®Web  members  must  be  careful  in  the  deductions  they  make  from  the  purported  group 
membership.  For  example,  even  if  a  process  p  was  a  member  of  a  web  through  the  de¬ 
livery  of  some  message  m,  other  web  members  cannot  assume  that  p  actually  processed 
any  message  ordered  before  m  unless  p  specifically  acknowledged  this  fact.  To  do  oth¬ 
erwise  would  be  assuming  a  solution  exists  to  the  coordinated  attack  problem,  which  is 
unsol vable  [Gra79]. 
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requests),  and  assume  that  p  has  failed  if  such  messages  cease  to  arrive. 
Once  all  web  members  agree  that  p  has  failed  (even  if  it  has  not),  the  new 
web  membership  is  defined. 

Since  MTP  is  a  NAK-based  protocol,  there  is  no  defined  low-level  •‘alive-' 
protocol.  A  web  membership  protocol,  however  can  be  implemented  on  top 
of  MTP  as  part  of  the  application  protocol.  Each  web  member  maintains 
a  set  that  contains  the  current  web  membership.  When  a  process  p  joins  a 
web,  p  multicasts  this  fact  to  the  web,  and  all  web  members  (including  p) 
add  p  to  their  membership  set  when  they  receive  this  message.  Similarly,  if 
a  process  p/  decides,  for  any  reason,  that  another  process  p  has  failed,  then 
p'  multicasts  this  fact  to  the  web.  If  p'  is  still  a  member  of  the  web  when 
this  message  is  delivered,  then  each  process  (including  p/)  removes  p  from 
its  membership  set  when  it  receive  this  message. 

Such  membership  information  is  of  interest  to  the  master.  As  discussed 
in  Section  3.1,  the  master  includes  a  list  of  multicast  TSAPs  in  a  token 
grant  message.  This  list  of  TSAPs  covers  the  membership  of  the  web  as 
known  by  the  master,  which  as  currently  presented  may  not  be  the  same 
as  the  membership  set  described  above.  The  solution  in  MTP  is  to  allow 
a  producer  and  receiver  to  execute  with  the  master.  These  processes  can 
exchange  membership  changes  each  observes-the  master  seeing  token  losses 
and  the  receiver  seeing  member-observed  failures.  By  doing  so,  the  master 
can  remove  a  multicast  TSAP  from  its  list  when  all  processes  reached  via 
that  TSAP  have  left  the  web,  and  the  producer  can  multicast  the  removal 
of  a  member  process  when  that  member  loses  a  token. 

5.4  Conclusions 

MTP  is  a  multicast  transport  that  supports  the  strong  conditions  of  agree¬ 
ment  on  delivery,  agreement  on  order  and  agreement  on  web  membership. 
An  implementation  of  MTP  is  currently  under  way. 
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Appendix:  Specification  and  proof 

This  appendix  presents  a  specification  and  a  proof  of  the  ordering  and  agree¬ 
ment  protocol.  In  interest  of  brevity,  the  proof  is  somewhat  informal  and 
incomplete;  in  particular,  several  simple  lemmas  are  stated  and  used  without 
proof. 

Let  po  be  the  master  process  and  p\  through  pn  be  the  member  processes. 
The  sequence  of  messages  that  p;  has  delivered  to  its  client  is  denoted  as  A/,, 
and  we  write  A/,  AT,  to  mean  that  M,  is  a  prefix  of  M}  or  M}  is  a  prefix 
of  A/,.  Similarly,  we  will  denote  by  A,  the  messages  that  p,  has  marked  as 
accepted  and  R,  the  messages  that  p,  has  marked  as  rejected.  Both  Ro  and 
Ao  are  defined,  but  as  there  is  no  client  of  the  master,  Mq  is  not  defined. 
The  sequence  number  of  a  message  sent  with  the  statement  multicast  ... 
["data",  s,  last,  m]  is  s  -  1,  which  we  will  denote  as  m.seq  10.  We  will  write 
mi  <  m2  as  shorthand  for  m\.seq  <  mj.seq  A  m\  £  Aq  a  m2  €  Ao- 

The  subset  of  processes  that  are  not  faulty  are  denoted  as  C.  The  state 
predicate  conn(pi,p3)  is  true  when  p,  and  p3  are  connected,  the  state  pred¬ 
icate  send(m,p,)  is  true  when  p,  sends  message  m,  the  state  predicates 
produce(i)  and  consume{i)  are  true  when  the  client  on  p,  requests  a  message 
to  be  sent  and  requests  data  respectively,  and  the  state  function  5  is  a  subset 
of  the  processes  pi,p2,  •  •  Pn- 

The  specification  consists  of  two  properties.  The  first  is  a  safety  property , 
which  specifies  that  “bad”  states  do  not  occur,  while  the  second  is  a  liveness 
property ,  which  specifies  that  “good”  states  will  eventually  occur. 

AB-1  The  sequence  of  messages  delivered  to  the  clients  do  not  diverge: 

□  (vPi-Pr:  Af.  ~  Af,) 

AB-2  There  exists  a  connected  subset  S  of  the  correct  processes  C  that 
make  progress: 

l0Formally,  any  reference  to  m  u  actually  a  reference  to  m.seq.  The  values  of  Ri  and 
A,-  for  i  >  0  are  state  functions  whose  values  are  defined  by  the  array  Producer. status 
and  Producer.data:  if  there  exists  a  state  in  which  process  p,  has  statusfi]  =  accepted  and 
data[fc]  /  empty,  then  in  that  state  m\  m.seq  =  k:  m  €  At,  and  if  there  exists  a  state 
in  which  process  p,  has  status[fc]  =  rejected,  then  in  that  state  m:  m.seq  =  k:  m  €  R, 
We  can  then  define  m  €  Af;  as  m  €  A*  A  nextOut  >  m.seg.  Similarly,  the  values  of  Ro 
and  Ao  are  defined  by  the  array  Master. status;  if  in  some  state  status[fc]  =  accepted,  then 
henceforth  m:  m.se c  =  next  -  k:  m  6  Ao,  and  if  in  some  state  status[A]  =  rejected  then 
henceforth  m;  m.seq  —  next  -  k:  m  6  Ro- 
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□  (Vm, Pj\  p,.p}  £  C:  send{m.p,)  A  □  (p,,£,  6  <5)  => 

O  m  £  A/,) 

Our  assumptions  are: 

•  All  M,.  R,.  and  A,  are  initially  empty; 

•  the  master  never  fails:  po  €  C; 

•  conn(p,,p})  is  an  equivalence  relation  {i.e.,  it  is  symmetric  and  tran¬ 
sitive); 

•  unbounded  fairness  is  followed  in  the  selection  of  enabled  guards  i.e.. 
a  guard  that  remains  true  will  eventually  be  selected; 

•  clients  on  correct  processors  always  continue  to  send  messages  and 
consume  messages: 

□  Vp,:  p,  €  C:  O  produce^ i )  A  O  consume^) 

Additional!’1',  we  will  assume  without  proof  that  the  protocol  satisfies  the 
following  three  lemmas: 

1.  The  delivery  of  a  message  is  monotonic: 

L\\  □  (Vm,p,:  (me  M,)  =>  □  (m  £  M,)) 

2.  A  process  cannot  both  accept  and  reject  the  same  message: 

L2:  O  (Vp,:  (m  e  A,)  =>  (m  g  tf,)) 

3.  Clients  receive  messages  in  message  sequence  number  order: 

l3:  □  (Vmj,pi:  mi  6  Mi  ^ 

(Vm 2:  m2.seq  <  m\.seq\  m2  £  M,  V  m2  £  R,)) 
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Showing  Safety  One  can  show  a  program  satisfies  a  safety  property  £  by 
finding  a  property  /  such  that  the  initial  conditions  Imt  imply  I .  I  implies 
□  I.  and  /  implies  £.  For  /.  we  will  use  the  conjunct  of  the  two  predicates 
1 1  and  I i'- 

Ix-  Vm.p,:  (m  £  R, )  =>  (m  £  Ro) 

Iy.  Vm1,m2,pl:  (mj  <  m2  A  mi  &  A/,)  m2  §(  M, 

Initially,  all  A/,  are  empty,  making  the  antecedents  of  I\  and  J2  both 
false;  thus,  Init  ^  I.  To  show  I\  and  /2  implies  AB-1,  note  that  together 
they  state  that  for  all  p,,  A/,  is  a  prefix  of  Mq.  Since  Af,  is  a  prefix  of  Mo 
and  Mj  is  a  prefix  of  Mo,  at  least  one  of  (Mi,  M})  is  a  prefix  of  the  other, 
meaning  M,  ~  Mj. 

We  now  prove  Ij  =>  □  I\.  By  L\,  I\  can  become  false  only  if  a  member 
p,  rejects  a  message  m  before  po  rejects  m.  For  p,  to  reject  m,  it  received  a 
data  message  from  a  member  process  p3  containing  values  of  s  and  last  such 
that  last{s  -  m.seq)  =  rejected.  To  send  such  a  data  message,  p3  must  have 
received  from  po  a  token  grant  message  containing  the  same  values  of  s  and 
last.  By  the  definition  of  R0,  m  £  Ro.  Thus,  =>  □  Jj. 

We  now  prove  J2  =>  □  /2.  /2  can  become  false  only  if  the  expression 
mi  -  m2, p,:  (mi  <  m2Ami  £  M,)Am2  £  M,  becomes  true  for  some  messages 
mi  and  m2  and  member  process  py,  that  is,  p,  delivers  a  message  m2  to  its 
client  but  has  not  (yet)  delivered  mi  to  its  client,  where  mi  and  m2  have 
both  been  accepted  by  po  and  m\.seq  <  m2.seq.  By  £3,  we  know  that 
m  1  £  M,  V  mi  £  R„  and  since  by  assumption  mi  $  Mi,  we  know  that 
mi  £  Rt.  By  I\,  we  know  that  m;  £  Ro,  but  since  mx  <  m2  we  know  that 
m  1  6  Ao-  This  is  a  contradiction  (it  violates  £2),  so  /2  =>  □  /2. 

Showing  Liveness  To  show  liveness,  we  will  first  assume  that  for  the 
master  D(t  >  next)  which  implies  D(status(t  -  1)  =  null).  We  will  then 
show  the  effects  when  t  is  assumed  to  have  a  more  reasonable  value. 

Property  AB-2  is  expressed  in  terms  of  the  set  of  processes  5;  we  will 
define  this  set  as  p,  £  S  =  conn(po,Pi)-  Rewriting,  we  get 


I3:  □  (Vm,pi,py:  p,,p;  e  C:  send(m,p,) A 

□  (conn(po,p,)  A  conn(po,p}))  =>  O  m  £  M}) 

To  show  /3,  we  will  need  the  following  five  liveness  properties,  of  which 
1 4,h  and  h  imply  /3: 
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IA:  □  (Vm,p,:  p,  £  C:  send(m,p,)  A  □  (conn(p0.  p,))  =>  O  m  £  .4o ) 

I .v  □  (Vm,pj:  pj  £  C:  m  £  A0  A  □  (conn(p0,  p: ))  =>  O  m  £  A}) 

I6:  □  (Vm.py.  Pj  £  C:  m  £  R0  AD  (conn(p0,  p}))  =>  O  m  £  R}) 

I?:  O  (Vm:  O  (m  g  ,4o  V  m  £  Z?o)) 

I8:  □  (Vm,p_,:  pj  £  C:  m  £  A  □  (conn(po,Pj))  =>  O  m  £  M}) 

For  brevity,  only  an  informal  proof  for  J5  will  be  shown.  If  there  are 
no  pj  that  satisfy  Is,  then  the  lemma  is  vacuously  true,  so  we  will  assume 
that  there  is  at  least  one  such  p;,  say  p* .  By  assumption,  the  producer 
on  pk  will  eventually  request  a  message  to  be  sent,  and  by  finite  progress 
Pk  will  eventually  send  a  message  to  po  requesting  a  token.  By  fairness  and 
connectivity,  po  will  eventually  select  the  guard  (status(t  — 1)  =  null).  By  the 
definition  of  Ao,  m  £  Ao  =>  status(next  —  m.seq)  =  accepted,  which  is  passed 
back  to  pk  in  the  token  grant  message  (again  by  fairness  and  connectivity). 

By  finite  progress,  pk  will  send  a  message  containing  the  value  of  last, 
and  since  connectivity  is  an  equivalence,  any  pj  is  connected  to  pk,  and  will 
therefore  receive  this  message.  Then,  by  finite  progress  p3  will  eventually 
set  m  £  Aj,  and  the  lemma  holds. 

The  effect  of  letting  t  to  be  smaller  than  the  maximum  sequence  number 
is  that  a  nonfaulty  process  p}  that  is  connected  to  po  may  not  satisfy  AB-2; 
in  particular,  Is  and  I8  may  not  hold.  A  sequence  of  /  >  t  token  requests  by 
processes  that  appear  to  fail  after  having  been  granted  a  token  will  generate 
a  sequence  of  /  rejected  messages.  However,  when  p}  receives  message  m,  it 
only  sets  the  status  for  messages  m':  m'.seq  >m.seq  -  t,  so  there  will  be  some 
message  whose  status  will  remain  pending.  Eventually,  nextln  -  nextOut 
will  be  greater  than  t  and  nextOut  will  point  to  the  pending  message,  forcing 
p}  to  rejoin.  Thus,  the  algorithm  is  live  only  if  there  are  no  sequences  of 
rejected  messages  with  a  length  of  /  >  t. 
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data(n) 
data(  n  1 ) 
data(n  4-  w  -  1] 


data(n  +  w) 
data(  n  +  U/’  +  1 ) 
data(n  +  2w  -  1) 


window  w  =  3 


retention  r  =  2 
heartbeat  h 


empty(n  +  2u>):  n..n  +  w  -  1 
can  be  released 

data(n  +  2 w)  with  eom: 
n  +  2. .n  +  w  +  2  can  be  released 


Figure  4:  Normal  Data  Transmission 
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retention  r  -  2 
heartbeat  h 


Figure  5:  NAKs  and  Retransmission 
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Figure  6:  Joining  and  State  Transfer 


